Lesson 4


Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes:

library(tidyverse)
## ── Attaching packages ───────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
pf<-read.csv('pseudo_facebook.tsv',sep='\t')

ggplot(data=pf,aes(x=age,y=friend_count))+
  geom_point()


What are some things that you notice right away?

Response:there are striations at specific dates, and young people have more friends than older users


ggplot Syntax

Notes:

ggplot(data=pf,aes(x=age,y=friend_count))+
  geom_point()+
  xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).


Overplotting

Notes:

ggplot(data=pf,aes(x=age,y=friend_count))+
  geom_jitter(alpha=1/20)+
  xlim(13,90)
## Warning: Removed 5173 rows containing missing values (geom_point).

What do you notice in the plot?

Response:the bulk of young users have less than 1000 users.


Coord_trans()

Notes:

friendscatter<-ggplot(data=pf,aes(x=age,y=friend_count))+
  geom_point(alpha=1/20,color="orange")+
  xlim(13,90)+
  coord_trans(y="sqrt")
friendscatter
## Warning: Removed 4906 rows containing missing values (geom_point).

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

What do you notice?


Alpha and Jitter

explore the relationship between friends initiated vs. age Notes:

ggplot(data=pf,aes(x=age,y=friendships_initiated))+
  geom_point(alpha=1/20,position=position_jitter(h=0))+
  xlim(13,90)+
  coord_trans(y="sqrt")
## Warning: Removed 5176 rows containing missing values (geom_point).


Overplotting and Domain Knowledge

Notes: friends that see posts makes more sense if you bound as a percentage based on how many total friends they have. ***

Conditional Means

Notes:

# age_groups<-group_by(pf,age)
# pf.fc_by_age<-summarise(age_groups,
#           friend_count_mean=mean(friend_count),
#           friend_count_median=median(friend_count),
#           n=n())
# pf.fc_by_age<-arrange(pf.fc_by_age,age)
# 
# head(pf.fc_by_age)

pf.fc_by_age<- pf%>%
  group_by(age)%>%
  summarise(friend_count_mean=mean(friend_count),
            friend_count_median=median(friend_count),
            n=n())%>%
  arrange(age)

head(pf.fc_by_age)
## # A tibble: 6 x 4
##     age friend_count_mean friend_count_median     n
##   <int>             <dbl>               <dbl> <int>
## 1    13               165                74.0   484
## 2    14               251               132    1925
## 3    15               348               161    2618
## 4    16               352               172    3086
## 5    17               350               156    3283
## 6    18               331               162    5196

Create your plot!

friendline<-ggplot(data=pf.fc_by_age,aes(x=age,y=friend_count_mean))+
  geom_line()
friendline


Overlaying Summaries with Raw Data

Notes:

ggplot(data=pf,aes(x=age,y=friend_count))+
  geom_point(alpha=1/20,color="orange")+
  geom_line(stat="summary",fun.y=mean)+
  geom_line(stat="summary",fun.y=quantile,fun.args=list(probs=.1),
            linetype=2,color="blue")+
  geom_line(stat="summary",fun.y=quantile,fun.args=list(probs=.5),
            color="blue")+
  geom_line(stat="summary",fun.y=quantile,fun.args=list(probs=.9),
            linetype=2,color="blue")+
  coord_cartesian(xlim=c(13,70),ylim=c(0,1000))

#### What are some of your observations of the plot? Response:Note: ggplot 2.0.0 changes the syntax for parameter arguments to functions when using stat = ‘summary’. To denote parameters that are being set on the function specified by fun.y, use the fun.args argument, e.g.:

ggplot( … ) + geom_line(stat = ‘summary’, fun.y = quantile, fun.args = list(probs = .9), … ) To zoom in, the code should use thecoord_cartesian(xlim = c(13, 90)) layer rather than xlim(13, 90) layer.

Look up documentation for coord_cartesian() and quantile() if you’re unfamiliar with them.


Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes:

cor.test(pf$age,pf$friend_count,method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737
with(pf,cor.test(age,friend_count,method="pearson"))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes:

with(subset(pf,age<=70), cor.test(age, friend_count),method='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes: The Pearson correlation evaluates the linear relationship between two continuous variables. … The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data. Spearman correlation is often used to evaluate relationships involving ordinal variables. Pearson is far too sensitive to influential points/outliers for my taste, and while Spearman doesn’t suffer from this problem, I personally find Kendall easier to understand, interpret and explain than Spearman.

If, for example, one variable is the identity of a college basketball program and another variable is the identity of a college football program, one could test for a relationship between the poll rankings of the two types of program: do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank correlation coefficient can measure that relationship, and the measure of significance of the rank correlation coefficient can show whether the measured relationship is small enough to likely be a coincidence.

pearson > normal spearman > robust to outliers ***

Create Scatterplots

Notes:

ggplot(data=pf,aes(x=www_likes_received,y=likes_received))+
  geom_point()+
  coord_trans(x='sqrt',y='sqrt')


Strong Correlations

Notes:

ggplot(data=pf,aes(x=www_likes_received,y=likes_received))+
  geom_point()+
  xlim(0,quantile(pf$likes_received,.95))+
  ylim(0,quantile(pf$likes_received,.95))+
  geom_smooth(method ='lm',color='red')
## Warning: Removed 4936 rows containing non-finite values (stat_smooth).
## Warning: Removed 4936 rows containing missing values (geom_point).
## Warning: Removed 31 rows containing missing values (geom_smooth).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

with(pf, cor.test(likes_received,www_likes_received,method='pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  likes_received and www_likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response:


Moira on Correlation

Notes:


More Caution with Correlation

Notes:

# install.packages('alr3')
library(alr3)
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
data(Mitchell)
?Mitchell

Create your plot!

ggplot(data=Mitchell,aes(x=Month,y=Temp))+
  geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot. 0
  2. What is the actual correlation of the two variables? (Round to the thousandths place)
with(Mitchell,cor.test(Month,Temp,method='pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes:


A New Perspective

What do you notice? Response:

Watch the solution video and check out the Instructor Notes! Notes:


Understanding Noise: Age to Age Months

Notes:

ggplot(data=Mitchell,aes(x=Month,y=Temp))+
  geom_point()+
  scale_x_continuous(breaks=seq(0,203,12))


Age with Months Means

p1<-ggplot(data=subset(pf.fc_by_age,age<71),
           aes(x=age,y=friend_count_mean))+
  geom_line()
p1

pf%>%
  mutate(age_with_months=age+(1-dob_month/12))->pf

# pf$age_with_months <- with(pf, age + (1 - dob_month / 12))

Programming Assignment

pf.fc_by_age_months<- pf%>%
  group_by(age_with_months)%>%
  summarise(friend_count_mean=mean(friend_count),
            friend_count_median=median(friend_count),
            n=n())%>%
  arrange(age_with_months)

p2<-ggplot(data=subset(pf.fc_by_age_months,age_with_months<71),
       aes(x=age_with_months,y=friend_count_mean))+
  geom_line()
p2


Noise in Conditional Means

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p2,p1,ncol=1)


Smoothing Conditional Means

Notes:

p1<-p1+geom_smooth()
p2<-p2+geom_smooth()
p3<-ggplot(data=subset(pf,age<71),
           aes(x=round(age/5)*5,y=friend_count))+
  geom_line(stat='summary',fun.y=mean)

grid.arrange(p2,p1,p3,ncol=1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'


Which Plot to Choose?

Notes: depends on the purpose - sometimes you need all of them!


Analyzing Two Variables

Reflection:scatter plots reveal potential correlations that cor tests cannot, jitter and alpha helps with graph readability, and bin/smoothing can dramatically change the look of a graph.


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!